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1 Introduction 

Majorization algorithms or MM algorithms are an increasingly popular class of tech- 
niues for iteratively solving the problem of minimizing a real valued function / over a set 
X C M n . Some of the pertinent early references are De Leeuw (1994) and W. Heiser (1995), 
and there are recent book-length treatments in De Leeuw (2016) and Lange (2016). 

Our treatment here is more general than the usual one. A majorization algorithm is defined 

by a majorization scheme, which is 

• a point-to-set map Z : X X with y G Z(y) for all y G X. 
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• a function g : X ® X =>■ M such that f(x) < g(x,y ) for all y G X and x G Z(y) and 

g(y,y) = f{y) for all ytX- 

The update map of a majorization scheme is 

u(y) = argmin {g(x, y) \ x G Z(y)}. (1) 

Iteration k + 1 of a majorization algorithm updates the current solution rr^by the rule 

• if x^ G u;(x^) stop, 

• else x( k+1 ) G c j(x^). 

The algorithm is stable, in the sense that it either stops at a fixed point x G u(x) or, by the 

sandwich inequality, 

/0r (fc+1) ) < g(x (k+1 \x (k) ) < g{x( k \x {k) ) = f(x (k) ) 

an iteration strictly decreases the objective function /. Our definitions are more general than 
the usual ones, because Z(x can be different for each k and we only require fix) < g(x, x^) 
for x G Z(x^). 

Convergence of majorization algorithms to a fixed point x G oj(x) is guaranteed under 
conditions that are satisfied in many applications (De Leeuw (2016), Lange (2016)). For 
most majorization algorithms used so far convergence is linear, with a convergence rate equal 
to the spectral norm, i.e. the eigenvalue of maximum modulus, of the derivative T>uj(y) at 
the fixed point y. This paper is about, computing that derivative in several majorization 
situations. 

There is a huge literature on derivatives of the update map in the area of parametric 
mathematical programming, where the update map is usually called the solution map. 
Good overviews of the initial developments of the field are in Bank et al. (1982) and Fiacco 
(1983). Levitin (1994) is a truly virtuoso and exhaustive treatment, primarily using the 
tools of classical analysis. Recent modern book length treatments, using the more abstract 
framework of modern variational analysis, are in Bonna.ns and Shapiro (2000), Luderer, 
Minchenko, and Satsura (2002), and Dontchev and Rockafellar (2014). Excellent introductory 
overviews of the field are Giorgi and Zuccotti (1918) and the lecture notes of Still (2018). 

2 No Constraints 

If there are no constraints on x the derivative of the update map in a majorization algorithm 
follows from a straightforward application of the implicit function theorem (IFT). More 
precisely, the IFT guarantees the existence of the derivative, and then the actual derivative 
is easily computed. The result below, which applies the IFT to majorization algorithms, is 
taken from De Leeuw (2016). 

Define Z{y) = M n for all y. We assume that g is two times continuously differentiable in a 
neighborhood of the fixed point (u(y),y). 
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At a minimum 


Vig{u{y),y) = 0 


Differentiating again 

g(u,y)T>u(y) + V 12 g{u,y) = 0, 

and thus, assuming the inverse exists, 

Vu(y) = -[V u g{uj{y),y)}- l V l2 g(uj{y),y). 

So far, nothing new. This is just the usual formula for the derivatives of an implicit function. 
But in our case g is a majorization scheme for /. Thus g(x,y) — f(x ) has a minimum over x 
at y, with minimum value zero. This implies that 

Vig{y,y) = Vf(y) 

for all y, and thus, assuming / is also twice differentiable, 

£>11 g(y, y ) + V 12 g(y, y) = V 2 f(y), 

Majorization also implies that T> u g(y,y) >z V' 2 f(y) in the Loewner sense, i.e. T>ug(y,y) — 
T> 2 f(y) is positive semi-definite. Thus T>i 2 g(y,y) is symmetric and negative semi-definite. 

Combining majorization with the derivative from the IFT shows that at a fixed point y, 
Vu(y) = -[D^y.y^V^y.y) = I - [D n g(y, y)}~ L V 2 f(y). 


2.1 Example: SMACOF 

The Euclidean least squares metric multidimensional scaling problem is the minimization, 
after a change of coordinates detailed in De Leeuw (1993), of the function 

K r - 1 

fix) = 1-53 w k 8 k \]x'A k x + -x'x, 

k=l ~ 


where the are positive semi-definite matrices of order n that add up to the identity. The 
w k and 5 k are positive weights and dissimilarities. It follows that, provided x'A k x > 0 for 
all k , that 

Vf(x ) = x — B{x)x , 


where 


Also 


K 


B(x)±'£w k -f^=A, 

vx'A k x 


V 2 f(x) = I-H(x), 


where 


K 


H ( X ) = Y1 W k 


k =1 


8k 

\Jx’ A k x 



A k xx'A k 

x'A k x 
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The SMACOF majorization algorithm (De Leeuw (1977), De Leeuw and Heiser (1977), De 
Leeuw and Heiser (1980), De Leeuw (1984), De Leeuw (1988), Borg and Groenen (2005), De 
Leeuw and Mair (2009)) uses the Cauchy-Schwartz inequality in the form 

^ - vm x ' Aky - 

This implies that g, defined as 

g(x, y) = 1 - x'B(y)y + ]^x'x 
majorizes / at y, which leads to the algorithm 

x {k+1) = B(x (k) )x {k) . 

By direct calculation the derivatives of the algorithmic map at a fixed point y is H(y). From 
our previous implicit function formulas we get 

Vco(y) = I - [V ll9 (y, y)]- l V 2 f{y) = I - - H(y )) = H(y), 

while also V l2 g(y,y) = -H(y). 

3 Equality Constraints 

Suppose hj : M n M are m continuously differentiable functions, and 

X = {x e M n | hj{x) = 0}. 

We assume that for each y the minimum of g over x G X is unique and attained at u(y), 
and that the m vectors T>hj{u{y )) are linearly independent. At a solution of the stationary 
equations we have 


£*i g{v{y),y) + X! x j(y) vh j( u (y)) = 

3 =1 

h j(u(y)) = 0 , 

where the A j are Lagrange multipliers, which are uniquely defined because of the linear 
independence assumption. 

Define 

m 

A = V n g(w(y),y) + X \{y)^ 2 hj{u{y)), 

3 =1 

and 
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B^hWy)), 

E = D 12 g(u(y),y), 
F = Duj(y), 

G = V\(y). 

Differentiating the stationary equations gives 


£>11 g(u(y),y) + E'JU \j{y)V 2 hj{u{y)) £>1 h{u{y)) 

Vuj(y) 


-V v2 g(u(y),y) 

V 1 h(u(y)) 0 

VX (y)_ 


0 


AF + B'G = -E, 
BF = 0, 

which has the solution, assuming A is invertible, 


G = -(BA~ 1 B')~ 1 BA~ 1 E, 

F = -A- 1 (I - B'(BA~ 1 B')~ 1 BA~ 1 )E. 

The convergence rate of the algorithm is the eigenvalue of F with largest modulus, evaluated 
at the fixed point where ui(y) = y. 

Also, majorization of / by g implies that T>\g(x,x) = T>f{x ) for all x, and thus at a fixed 
point 

-E = V l2 g(x , x) = V u g(x, x ) - V 2 f(x). 

This, in turn, implies that —E is symmetric and positive semi-definite. 

3.1 Example: Power Method 

Consider the problem of minimizing / with f(x) = —^x'Vx, with V positive definite of order 
n, over all x satisfying x'x = 1 and Ux — 0, with U an m x n matrix of rank m. 

Since —\(x — y)'V(x — y) < 0 we see that g with g{x,y) = \y'Vy — x'Vy is a suitable 
majorization function. In this case we can derive an explicit formula for the update map, and 
consequently we can check if the results correspond with the ones derived from our version of 
the IFT for the Langrangian. 

The Langrangian stationary equations for solving the majorization iterations are 
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—Vy + Xx + U' n = 0, 

/ 1 

X X = 1, 

ffir = 0. 

Thus n = {UU')~ l UVy and QVy = Xx, with Q the orthogonal projector 

Q^I -U'(UU')~ l U. 

We find A = \Jy'V QVy, and the update map is 

w(y) = jQVy. 

Thus the majorization algorithm is the power method for the matrix QV, which converges to 
its largest eigenvalue and corresponding right eigenvector. 

Taking derivatives of the update map 

Vcj(y) = \{qV~ ^ QVyy'VQV } , 

At a stationary point QVy = Xy, and thus 

Vu(y) = j{QV-yy'V} 

Note that if y is a right eigenvector of QV then Vy is a left eigenvector. The convergence 
rate of the majorization algorithm is the ratio of the second largest to the largest eigenvalue 
of QV. 

In addition we find Vy(y) = (UU , )~ 1 UV and T>X(y) = \VQVy, which is VX(y) — Vy at a 
stationary point. 

Now let’s see if we get the same solution by solving the system 


AF + B'G = -E, 
BF = 0. 


For our example 


A — XI, 

B = 

x' 

U 

E = 

-V, 
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We must have 


XF + 


x U' 


G = V, 


and 


F — 0. 


For the derivatives of the Lagrange multipliers we find 

G = 


x'V 

(yu')~ l uv 


which corresponds with our results using the explicit formulas. Also 


A F = V 


X u> G = V - xx'V -UiUU'y'UV = QV - xx'V, 


which is indeed what we found using the explicit expression for to. 


3.2 Example: Least Squares with Quadratic Constraint 

Consider the problem of minimizing 

f(x) = ^{x~u)'V(x-u) 


over all x satisfying | x'Wx = 1. Here W is assumed to be positive definite. This problem, or 
some variation of it, has been studied by many authors, starting with Forsythe and Golub 
(1965), Spjptvoll (1972), Gander (1981). 

We use majorization to construct an iterative algorithm. Suppose k is such that V F kW. 
Then 

f(x) < f(y) + (x - y)'V{y - z) + \{x - y)'W{x - y). 

We can minimize the majorization function over | x'Wx = 1 by minimizing the linear function 
x'V(y — z) — kx'W y over A x'Wx = 1. The minimum is attained for 

x oc y - W~ l V{y — z), 

Kj 


and we hnd the update by normalizing. The derivative of the update function, computed 
directly, is 


Vu(y) 


yy'w 

y'Wy 


I--W~ l V 

K 


The Lagrangian equations for the majorization problem are 


V(y-z)~ nWy + A (y)Wu(y) = 0, 
\o(y)'Wu(y) = 1 
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Thus 


A \y)Wu(y) + A (y)WVu(y) = 0, 
u(y)'WVcu(y) = 0. 


Vu,(v) = 

Thus A \y) = u(y)'Wu(y) and A (y) = Vu(y)'WVu(y). 

The IFT with 


A = XW 
B = Wx, 

E = V — kW 


f = -(xwyUi-x'wiWxixwy'xwy^xixwy^iv-KW) = -a ~ 2 w~\i — -^—xx'wy l x 'w- l ( 

x'W l x 

4 Parametric Equality Constraints 

A more general problem is to compute, and study, 

u(y) = argmin {g(x , y) \ H(x , y) = 0}, 

m 

x(y) = {i 6 r V hj(x,y ) = 0}. 

3 = 1 

where the constraints also depend on the parameter y.The relevance for majorization algo¬ 
rithms may not be immediately obvious. When treating inequality constraints in a subsequent 
section, however, the formulas in this section will be useful. 

The Lagrangian equations are 


m 


Vi g(u(y),y) + \(y) v i h j(u(y),y) 

3=1 


o, 


H{u{y),y) = 0. 


Differentiating again gves 



Vng(uj(y),y)Vx(y) + V l2 g(u(y),y)+ 

m 

^j(y){ /D uh j (uj(y),y)Vuj(y) + V 12 h(u(y),y)}+ 

3 =i 

m 

J2 DX j(y) v i h j(u(y),y) = 0, 

3 =i 

and 

V 1 h j (uj(y),y)Vuj(y) + V 2 hj(uj(y),y ) = 0. 

Define 

m 

A = V n g(uj(y),y) + ^ \j(y)V n hj(uj(y),y), 

3 =1 

and 

£ = X>i/i(a;(j/),j/), 

C = V 2 h(u(y),y ), 

m 

D = J2 x j(y) v ^hj(uj(y),y), 

3 =1 

E = V 12 g(u}(y),y), 

F = Vuj(y), 

G = VX(y). 

We must solve 

AF + F'G = -(F + F>), 

FF = -C. 

We have F = -^(F'G + E + F>) and BA-\B'G + E + D) = C. Thus 

G = (FA~ 1 F , )“ 1 {G - BA~\E + F>)}, 

and 

F = -A- l B\BA~ 1 B')- 1 C - {A- 1 - A" 1 F' (BA~ l B')- 1 BA~ l }{E + F>). 

We recover the results for the basic case from the previous section by setting C = 0 and 
D = 0. 
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5 Inequality Constraints, Part I 

If the minimization problem solved in the majorization step involves inequality constraints 
to(y) = argmin {g(x, y) \ H(x) = 0 A K(x) > 0}, 

£ m 

X(y) = {i 6 K"' V hj(x,y ) = 0, V hj(x,y) > 0}. 

j= 1 3=1 +1 

then we cannot expect differentiability any more. 

Of course an inequality constraint like hj(x) > 0 can be rewritten as an equality constraint 
hj(x) — Zj — 0, where the Zj are additional variables. 

Just consider the very simple example 


and 


0(y) = min {(x — y) 2 \ x > 0} 


0 if y > 0, 
y 2 otherwise, 


u(y) = argmin {(x — y) 2 x > 0} 


y if y > o, 

0 otherwise. 


The value function 6 is differentiable at zero, but not twice differentiable. The solution 
function to is continuous, but not differentiable at zero. At zero it does, however, have a left 
derivative equal to zero and a right derivative equal to one. 

Minimizing g(x,y ) = ( x 2 — y 2 ) 2 over x > 0 gives 9 identically equal to zero, and co(y) = \y\, 
continuous but not differentiable at zero. For g(x,y) — (x — y 2 ) 2 and x > 0 we again find 6 
identically equal to zero b but now to(y) = y 2 , which is differentiable. These small examples 
show that in general the solution function will not be differentiable, but may be directionally 
differentiable. 


Convert 
kj(x ) — z 2 — 0 

If our original problem is to minimize / over all x satisfying hj(x) = 0 and pi{x) < 0, then it 
may make sense to majorize both / and the k^. 

If g : M (g) M M majorizes / at y and qt : M <g) M M majorizes k e ll at y , then the 
majorization algorithm is 

x^ k+v> = argmin {g(x, x^ \ hj(x) = 0 A qe(x,y < 0} 

In fact we will look at the more general problem 

u(y) = argmin {g(x,y) \ hj(x,y ) = 0 A p t (x,y) < 0} 

X 

From the Karush-Kuhn-Tucker necessary conditions for this minimization problem 

m r 

Vi g(u(y),y) + J2 X q(y) v i h j(u(y),y) +^2lt{y)V 1 p i {u{y),y) = 0 
j =i i =i 
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hj(u{y),y) = 0, 

Pt(u(y),y) < 0, 

7 e(y) > 0, 

7 e(y)pt(w(y),y) = 0. 

6 Inequality Constraints, Part II 

If the conditions of the previous section are not satisfied we cannot guarantee differentiability, 
and we need to take a step back. First, we need to use the version of the linear convergence 
theorem for fixed point iterations from Sheng and Xu (2001). Then we need to step down a 
level from continuous differentiability to directional differentiability and the Clark Jacobian, 
following Dempe and Pallaschke (1997). 
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